The Arc Prize Foundation, co-founded by AI researcher François Chollet, has unveiled ARC-AGI-2, a new, ultra-challenging benchmark designed to test AI models’ true general intelligence. The results? Most leading models have failed spectacularly.
According to the Arc Prize leaderboard, top reasoning models like OpenAI’s o1-pro and DeepSeek’s R1 barely managed 1%-1.3% accuracy. Other powerful models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, scored around 1%.
In contrast, human participants, averaging 60% accuracy, outperformed AI models by a wide margin.
Unlike traditional AI benchmarks, ARC-AGI-2 presents adaptive, pattern-recognition challenges requiring AI to generate correct responses without prior exposure. Unlike its predecessor, ARC-AGI-1, this new test:
🔹 Prevents brute-force computing from inflating scores
🔹 Emphasizes efficiency, measuring how AI acquires and applies new skills
🔹 Eliminates reliance on memorization, forcing models to think on the fly
Even OpenAI’s o3 (low) model, which dominated ARC-AGI-1 with 75.7% accuracy, scored just 4% on ARC-AGI-2—and only after burning $200 in compute per task.
Tech leaders, including Hugging Face’s Thomas Wolf, argue that AI development lacks rigorous benchmarks for creativity and general intelligence. ARC-AGI-2 may be the first real step in addressing that gap.
To push AI research forward, the Arc Prize Foundation has launched the Arc Prize 2025 challenge, offering a prize for any model that achieves 85% accuracy on ARC-AGI-2 while spending just $0.42 per task.
With AI models struggling to pass this test, the question looms: How far are we from real artificial general intelligence?